1.1. Regression Models for Predicting Number of Views and Likes per Video¶
This notebook contains an exploration of the target variables in the datasets on data/processed/.
import os
import torch
import joblib
import warnings
import numpy as np
import pandas as pd
import torch.nn as nn
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from torch.utils.data import TensorDataset, DataLoader
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from youtube_trends.config import PROCESSED_DATA_DIR, MODELS_DIR
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
pio.renderers.default = 'notebook_connected'
warnings.filterwarnings('ignore')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
2025-05-25 04:59:47.062 | INFO | youtube_trends.config:<module>:11 - PROJ_ROOT path is: C:\Users\eddel\OneDrive\Documents\MCD\AAA\youtube_trends\venv\src\youtube-trends
Preparing training, validation, and testing datasets¶
Loading processed datasets df_train, df_val and df_test.
df_train = pd.read_csv(PROCESSED_DATA_DIR / 'train_dataset.csv', low_memory=False)
df_val = pd.read_csv(PROCESSED_DATA_DIR / 'val_dataset.csv', low_memory=False)
df_test = pd.read_csv(PROCESSED_DATA_DIR / 'test_dataset.csv', low_memory=False)
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)
(46337, 199) (7626, 199) (6736, 199)
Feature selection.
drop_cols = ['video_view_count', 'video_like_count', 'days_to_trend', 'video_published_at']
X_train = df_train.drop(columns=drop_cols)
X_val = df_val.drop(columns=drop_cols)
X_test = df_test.drop(columns=drop_cols)
Only numeric columns are maintained for regression models. If --translate=True was specified during dataset processing, the video_title_translated function is available for interpretation, but won't be considered in the models in this notebook.
X_train = X_train.select_dtypes(include=np.number)
X_val = X_val.select_dtypes(include=np.number)
X_test = X_test.select_dtypes(include=np.number)
X_train.describe()
| video_duration | video_comment_count | channel_view_count | channel_subscriber_count | published_dayofweek | published_hour | video_title_length | video_tag_count | sentiment_score | sentiment_negative | ... | years | lang_pca_0 | lang_pca_1 | lang_pca_2 | lang_pca_3 | lang_pca_4 | video_category_pca_0 | video_category_pca_1 | video_category_pca_2 | video_category_pca_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 46337.000000 | 46337.000000 | 4.633700e+04 | 4.633700e+04 | 46337.000000 | 46337.000000 | 46337.000000 | 46337.000000 | 46337.000000 | 46337.000000 | ... | 46337.000000 | 4.633700e+04 | 4.633700e+04 | 4.633700e+04 | 4.633700e+04 | 4.633700e+04 | 4.633700e+04 | 4.633700e+04 | 4.633700e+04 | 4.633700e+04 |
| mean | 1012.372273 | 3836.278438 | 6.040277e+09 | 1.197975e+07 | 3.197898 | 12.679932 | 9.143212 | 0.655329 | 0.049619 | 0.144204 | ... | 0.005082 | -7.851132e-17 | -3.680218e-17 | 1.242074e-17 | -3.291112e-17 | 3.649070e-17 | 2.576153e-17 | 4.968294e-17 | 2.453479e-17 | 2.085457e-17 |
| std | 2819.329071 | 8609.104544 | 1.578495e+10 | 3.989381e+07 | 1.958441 | 5.814503 | 4.036997 | 0.493445 | 0.281998 | 0.351301 | ... | 0.064020 | 4.589096e-01 | 2.054438e-01 | 1.585670e-01 | 1.435839e-01 | 1.414110e-01 | 4.796688e-01 | 3.860894e-01 | 3.538891e-01 | 3.258885e-01 |
| min | 10.000000 | 0.000000 | 2.163470e+05 | 5.600000e+01 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | -0.897900 | 0.000000 | ... | 0.000000 | -8.912856e-01 | -3.484205e-01 | -2.960471e-01 | -5.788040e-01 | -5.823643e-01 | -3.966943e-01 | -7.042180e-01 | -4.341486e-01 | -6.524535e-01 |
| 25% | 36.000000 | 335.000000 | 2.397540e+08 | 6.100000e+05 | 1.000000 | 9.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | -6.937320e-01 | 7.831350e-03 | 1.800121e-03 | 2.321304e-04 | -2.709102e-04 | -3.926991e-01 | -7.191368e-03 | -4.194579e-01 | -3.912901e-02 |
| 50% | 160.000000 | 1138.000000 | 1.181903e+09 | 2.650000e+06 | 3.000000 | 13.000000 | 9.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 2.746552e-01 | 7.831350e-03 | 1.800121e-03 | 2.321304e-04 | -2.709102e-04 | -2.536230e-01 | -2.479025e-03 | -6.956865e-02 | -1.005693e-02 |
| 75% | 984.000000 | 3205.000000 | 4.687050e+09 | 1.090000e+07 | 5.000000 | 17.000000 | 12.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 2.746552e-01 | 7.831350e-03 | 1.800121e-03 | 2.321304e-04 | -2.709102e-04 | 7.745709e-01 | 2.370603e-03 | 1.999355e-01 | 1.919796e-02 |
| max | 42901.000000 | 82964.000000 | 2.970556e+11 | 3.960000e+08 | 6.000000 | 23.000000 | 25.000000 | 4.000000 | 0.944600 | 1.000000 | ... | 1.000000 | 2.746552e-01 | 8.052141e-01 | 8.589263e-01 | 7.906126e-01 | 7.749721e-01 | 7.745709e-01 | 7.099129e-01 | 6.165521e-01 | 7.542877e-01 |
8 rows × 195 columns
Standardization of the numeric features.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
joblib.dump(scaler, MODELS_DIR / 'scaler_regression.pkl')
['C:\\Users\\eddel\\OneDrive\\Documents\\MCD\\AAA\\youtube_trends\\venv\\src\\youtube-trends\\models\\scaler_regression.pkl']
Target features¶
y_train_dict = {'video_view_count': df_train['video_view_count'], 'video_like_count': df_train['video_like_count']}
y_val_dict = {'video_view_count': df_val['video_view_count'], 'video_like_count': df_val['video_like_count']}
y_test_dict = {'video_view_count': df_test['video_view_count'], 'video_like_count': df_test['video_like_count']}
Note: Since the number of likes per video is strongly correlated with the number of views, we'll assume an explicit relation of dependancy between these features; more specifically, that the number of likes depends on the number of views, being the likes less than or equal to the views. This assumption is based on the premise that YouTube counts any play of the video as a view, regardless of the time it was played. This assumption might not be entirely true, since YouTube counts views based on a combination of factors, including the number of times a video is watched, the duration of each view, and user interactions. While the exact algorithm is not disclosed, a view generally requires a viewer to watch a portion of the video, often a minimum of 30 seconds, and actively engage with the content.
X_train_like = np.concatenate([X_train_scaled, df_train[['video_view_count']].values], axis=1)
X_val_like = np.concatenate([X_val_scaled, df_val[['video_view_count']].values], axis=1)
X_test_like = np.concatenate([X_test_scaled, df_test[['video_view_count']].values], axis=1)
Regression models¶
To predict the number of views and likes we use the following regression models.
model_classes = {
'Linear_Regression': LinearRegression,
'Ridge': Ridge,
'Lasso': Lasso,
'Decision_Tree': DecisionTreeRegressor,
'Random_Forest': RandomForestRegressor,
'XGBoost': XGBRegressor
}
Training and visualization of predicted values versus actual values.
results = {}
results_test = {}
n_target = 0
for name, ModelClass in model_classes.items():
results[name] = {}
results_test[name] = {}
# video_view_count
model_vv = ModelClass()
model_vv.fit(X_train_scaled, y_train_dict['video_view_count'])
y_pred_vv = model_vv.predict(X_val_scaled)
mse_vv = mean_squared_error(y_val_dict['video_view_count'], y_pred_vv)
r2_vv = r2_score(y_val_dict['video_view_count'], y_pred_vv)
results[name]['video_view_count'] = {'MSE': mse_vv, 'R^2': r2_vv}
joblib.dump(model_vv, MODELS_DIR / f'{name}_views.pkl')
# Evaluate on test set
y_test_pred_vv = model_vv.predict(X_test_scaled)
mse_test_vv = mean_squared_error(y_test_dict['video_view_count'], y_test_pred_vv)
r2_test_vv = r2_score(y_test_dict['video_view_count'], y_test_pred_vv)
results_test[name]['video_view_count'] = {'MSE': mse_test_vv, 'R^2': r2_test_vv}
# video_like_count
model_vl = ModelClass()
model_vl.fit(X_train_like, y_train_dict['video_like_count'])
y_pred_vl = model_vl.predict(X_val_like)
mse_vl = mean_squared_error(y_val_dict['video_like_count'], y_pred_vl)
r2_vl = r2_score(y_val_dict['video_like_count'], y_pred_vl)
results[name]['video_like_count'] = {'MSE': mse_vl, 'R^2': r2_vl}
joblib.dump(model_vl, MODELS_DIR / f'{name}_likes.pkl')
# Evaluate on test set
y_test_pred_vl = model_vl.predict(X_test_like)
mse_test_vl = mean_squared_error(y_test_dict['video_like_count'], y_test_pred_vl)
r2_test_vl = r2_score(y_test_dict['video_like_count'], y_test_pred_vl)
results_test[name]['video_like_count'] = {'MSE': mse_test_vl, 'R^2': r2_test_vl}
# --- Visualizations ---
for target, y_pred, y_true in zip(['video_view_count', 'video_like_count'], [y_pred_vv, y_pred_vl], [y_val_dict['video_view_count'], y_val_dict['video_like_count']]):
if n_target % 2 == 0:
print(f'\nModel: {name}\n')
fig = go.Figure()
fig.add_trace(go.Scatter(x=y_true, y=y_pred, mode='markers', name='Predictions'))
fig.add_trace(go.Scatter(x=y_true, y=y_true, mode='lines', name='Ideal'))
fig.update_layout(
title=f'{name} for {target} (Validation)',
xaxis_title='True Values',
yaxis_title='Predicted Values',
template='plotly_white'
)
fig.show()
filename = f'figure_{100+n_target}.html'
filepath = os.path.join('iframe_figures', filename)
fig.write_html(filepath)
n_target += 1
Model: Linear_Regression
Model: Ridge
Model: Lasso
Model: Decision_Tree
Model: Random_Forest
Model: XGBoost
Results visualization (MSE and R^2 metrics)¶
for name, targets in results.items():
print(f"\nValidation Results for model {name}:\n")
for target, metrics in targets.items():
print(f"{target} — MSE: {metrics['MSE']:.2f}, R^2: {metrics['R^2']:.4f}")
Validation Results for model Linear_Regression: video_view_count — MSE: 222940215398476.88, R^2: 0.5820 video_like_count — MSE: 94001055879.53, R^2: 0.9225 Validation Results for model Ridge: video_view_count — MSE: 222909980452372.00, R^2: 0.5820 video_like_count — MSE: 94031282749.22, R^2: 0.9225 Validation Results for model Lasso: video_view_count — MSE: 222886591771235.97, R^2: 0.5821 video_like_count — MSE: 94092235818.60, R^2: 0.9224 Validation Results for model Decision_Tree: video_view_count — MSE: 530837949859836.94, R^2: 0.0047 video_like_count — MSE: 382052774613.48, R^2: 0.6850 Validation Results for model Random_Forest: video_view_count — MSE: 436456809751407.44, R^2: 0.1816 video_like_count — MSE: 429269284810.79, R^2: 0.6460 Validation Results for model XGBoost: video_view_count — MSE: 411265869880897.25, R^2: 0.2289 video_like_count — MSE: 406664651226.69, R^2: 0.6647
for name, targets in results_test.items():
print(f"\nTest Results for model {name}:\n")
for target, metrics in targets.items():
print(f"{target} — MSE: {metrics['MSE']:.2f}, R^2: {metrics['R^2']:.4f}")
Test Results for model Linear_Regression: video_view_count — MSE: 205935824731273.69, R^2: 0.0030 video_like_count — MSE: 89175911251.04, R^2: 0.4316 Test Results for model Ridge: video_view_count — MSE: 151289262543448.03, R^2: 0.2675 video_like_count — MSE: 71174926521.43, R^2: 0.5464 Test Results for model Lasso: video_view_count — MSE: 117290294973885.66, R^2: 0.4321 video_like_count — MSE: 62341365195.37, R^2: 0.6027 Test Results for model Decision_Tree: video_view_count — MSE: 244023322976896.06, R^2: -0.1814 video_like_count — MSE: 60328202903.22, R^2: 0.6155 Test Results for model Random_Forest: video_view_count — MSE: 158461357677578.53, R^2: 0.2328 video_like_count — MSE: 29899911543.11, R^2: 0.8094 Test Results for model XGBoost: video_view_count — MSE: 133287252779230.42, R^2: 0.3547 video_like_count — MSE: 28241817707.84, R^2: 0.8200
for metric in ["MSE", "R^2"]:
for target in ["video_view_count", "video_like_count"]:
values = [results[model][target][metric] for model in model_classes]
fig = go.Figure(data=[
go.Bar(x=list(model_classes.keys()), y=values)
])
fig.update_layout(
title=f"Validation {metric.upper()} Comparison for {target}",
xaxis_title="Model",
yaxis_title=metric.upper(),
template="plotly_white"
)
fig.show()
filename = f'figure_{100+n_target}.html'
filepath = os.path.join('iframe_figures', filename)
fig.write_html(filepath)
n_target += 1
for metric in ["MSE", "R^2"]:
for target in ["video_view_count", "video_like_count"]:
values = [results_test[model][target][metric] for model in model_classes]
fig = go.Figure(data=[
go.Bar(x=list(model_classes.keys()), y=values)
])
fig.update_layout(
title=f"Test {metric.upper()} Comparison for {target}",
xaxis_title="Model",
yaxis_title=metric.upper(),
template="plotly_white"
)
fig.show()
filename = f'figure_{100+n_target}.html'
filepath = os.path.join('iframe_figures', filename)
fig.write_html(filepath)
n_target += 1
Conclusions¶
Based on the validation results, the best models for predicting the number of likes (video_like_count) are the linear models—specifically Ridge and Lasso regression—which achieved high $R^2$ values around 0.92, indicating excellent predictive performance. For predicting the number of views (video_view_count), although all models showed moderate performance, the Lasso regression slightly outperformed others with an $R^2$ of 0.5821. Therefore, Ridge and Lasso are the most reliable models for predicting likes, while Lasso is the most suitable choice for estimating views during validation.
Considering that the splitting of training, validation, and test data was done based on the videos’ publication date and the compatibility of Ridge and Lasso with the test dataset, the best model for predicting both the number of views and likes is Lasso, although XGBoost showed better performance with the test dataset. This is because, if we choose to use a regression model to predict views and likes per video, we will actually only be using the training and validation dataset. In this textbook, we use the test dataset only to help us choose between Ridge and Lasso.